-
Notifications
You must be signed in to change notification settings - Fork 2.8k
[GPU]qwen3 moe fused compressed #32536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GPU]qwen3 moe fused compressed #32536
Conversation
f35b2cb to
4ccdcf1
Compare
faa6533 to
836d35c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds GPU support for Qwen3 MoE (Mixture of Experts) models with fused compressed weight optimization. The implementation introduces a transformation pipeline that converts standard MoE operations to compressed format and fuses routing operations (softmax/topk/onehot) into the MoE computation for improved performance.
Key Changes:
- New transformation passes:
FuseVectorizedMOE3GEMM→ConvertMOEToMOECompressed→FuseMOECompressed - Dual execution strategy: GEMM kernels for prefill stage, OCL kernels for decode stage
- Memory optimization through weight compression and operation fusion
Reviewed Changes
Copilot reviewed 25 out of 25 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| transformations_pipeline.cpp | Registers new MOE transformation passes in the GPU plugin pipeline |
| moe_opt.cpp/hpp | Implements optimized MOE execution with oneDNN and custom OCL kernels |
| moe_compressed.cpp/hpp | Defines base MOECompressed operation with compressed weight configuration |
| moe_fused_compressed.cpp/hpp | Defines MOEFusedCompressed that includes fused routing operations |
| convert_moe_to_compressed.cpp/hpp | Transformation to convert standard MOE to compressed weight format |
| fuse_moe_compressed.cpp/hpp | Transformation to fuse routing subgraph into MOE operation |
| keep_moe_const_precision.cpp/hpp | Prevents precision conversion of compressed weights and zero points |
| moe_opt.cl, moe_mlp.cl | OpenCL kernels for softmax_topk, gather, scatter, and MLP operations |
| paged_attention_opt.cpp | Adds workaround for OCL resource issue with small input tokens |
Comments suppressed due to low confidence (1)
src/plugins/intel_gpu/src/graph/impls/ocl_v2/moe_opt.cpp:1
- Remove commented-out unused code rather than leaving it in the codebase.
// Copyright (C) 2025 Intel Corporation
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
0fd5af0 to
827a9f6
Compare
b0b0841 to
001bed4
Compare
358a015 to
c824b67
Compare
c824b67 to
79d6a13
Compare
| /// shape [num_experts, hidden_size, group_num, 1] | ||
| /// 10: w2_zp - expert zp for final projection for compressed experts, | ||
| /// shape [num_experts, hidden_size, group_num, 1] | ||
| /// \param config Configuration for the MOE operation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This description is for 3gemm_Swiglu_type only. Please mention that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
| config.top_k = topk_shape.back(); | ||
| config.out_type = ov::element::dynamic; | ||
| auto topk_shape = pattern_map.at(topk_m).get_partial_shape(); | ||
| OPENVINO_ASSERT(topk_shape[1].is_static(), "k dimenion in moe topk input should be static."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use OPENVINO_THROW for important checking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
| class FuseMOECompressed: public ov::pass::MatcherPass { | ||
| public: | ||
| OPENVINO_MATCHER_PASS_RTTI("FuseMOECompressed"); | ||
| FuseMOECompressed(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This naming is also general, but it is for Gemm3 pattern's target. Please rename this too, for reducing the confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
| TEST(moe_compressed_gpu, moe_accuracy_test) { | ||
| auto& engine = get_test_engine(); | ||
| if (!engine.get_device_info().supports_immad) { | ||
| std::cout << "not support immad, skip test" << std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove debug print.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
| struct moe_fused_compressed : public primitive_base<moe_fused_compressed> { | ||
| CLDNN_DECLARE_PRIMITIVE(moe_fused_compressed) | ||
|
|
||
| moe_fused_compressed() : primitive_base("", {}) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please modify the primitive name too, for the specifc target pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
| namespace details {} | ||
|
|
||
| template <> | ||
| struct typed_program_node<moe_fused_compressed> : public typed_program_node_base<moe_fused_compressed> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File name is too general: moe_inst.h
Pleae rename all the relevant primitives, test names, inst, node name to
moe_fused_3gemm_swiglu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
|
I checked there is no impact for gpt-oss. Only minor comments added |
e61e47a
Details:
Tickets: